专利摘要:
The method includes the steps of: acquiring (110) an acoustic guide signal (g (t)) of a reproduction of the only specific contribution; determining (120, 130, 140) a parametric modeling of a specific signal (Vyp) corresponding to the only specific contribution, taking into account a fundamental frequency correction of each time frame; determining (150) a parametric modeling of a background sound signal (Vpz) corresponding to the single background component; estimating (170, 190) a specific intermediate signal (Vig) and an intermediate sound signal (Viz), adjusting the parameters of the modelings and using the mixing acoustic signal (x (t)); and filtering (200) the mixing acoustic signal using the specific intermediate signal and the background sound intermediate signal, to obtain a specific acoustic signal (y (t)) and a background acoustic signal (z (t)).
公开号:FR3013885A1
申请号:FR1361792
申请日:2013-11-28
公开日:2015-05-29
发明作者:Romain Hennequin
申请人:Audionamix;
IPC主号:
专利说明:

[0001] The present invention relates to the field of methods and systems for separating specific contributions and background music in a mixing acoustic signal, and in particular to a method for separating specific contributions and background noise in a mixing acoustic signal. , a contribution of dialogue, a contribution of background sound, in a mixing acoustic signal. A soundtrack of a movie or television series includes dialogues superimposed on special sound effects and / or music. For an old film, the soundtrack is a mixture resulting from the superposition of these at least two contributions and these two contributions are generally not accessible separately. Therefore, if one wishes to be able to diffuse this film in a version other than the original version, it is necessary to separate the contribution of dialogue of the contribution of background sound in the original soundtrack, before being able to add, on the sound background thus isolated, a doubling of the dialogue in a destination language, to obtain a new soundtrack. Similarly, the producers of a film may have obtained the rights to broadcast music only for a given territory or for a given period. It is impossible to broadcast a film whose soundtrack does not respect these contractual conditions. It is then necessary to be able to separate the contribution of dialogue, that of background sound, before being able to add to the isolated original dialogue, a new music, to obtain a new soundtrack. There is therefore a need for a method for separating a contribution of dialogue, of a contribution of sound background, in a sound signal corresponding to the mixture of these two contributions, in order to obtain on the one hand a sound signal of dialogue alone and, on the other hand, a background sound signal alone. In the general area of audio signal processing, source separation is an important topic of the last decade. In the prior art, the problem of source separation has initially been addressed in a context of blind source separation. In particular, a method of factorization in non-negative matrices (or NMF method according to the acronym of "Non-negative Matrix Factorization") is used. For example, T. Virtanen's paper "Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria," IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, no. 3, pp. 1066-1074, March 2007, discloses such an NMF method. However, one of the main drawbacks of the NMF method lies in the difficulty of grouping the factored elements and associating them with a particular source.
[0002] Recently, it has been proposed to add additional information upstream of the NMF method to facilitate and improve separation. In the particular field of the separation of musical sources (that is, from a musical instrument in the middle of an orchestra), for example, a method has been proposed in which the different spectral shapes of each instrument are learned. from isolated sounds. The spectral shapes obtained are then used as additional information to separate the different sources in the mixture. For example still, according to another method, a MIDI file is used as additional information to facilitate the separation of instruments in a piece of music.
[0003] In the particular field of speech separation on a background, it has been proposed to use a guidance sound signal, mimicking the dialogue contribution of the mixing signal, to guide the separation by providing additional information. . More particularly, the guidance signal corresponds to a recording of the voice of a speaker doubling the target dialogue contribution to be separated. Such an approach has been proposed in the paper by P. Smaragdis and G. Mysore "Separation by humming: user-guided sound extraction from monophonic mixture," in Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA , October 2009. This document uses a method based on a probabilistic Latent Component Analysis ("PLC"). This uses a guidance signal that mimics the dialogue contribution to extract from the mix signal, and serves as a prerequisite for PLCA analysis. However, this method of the state of the art suffers from a lack of robustness with respect to changes in the pitch of the fundamental frequency between the sounds emitted by the speaker (s) of the dialogue contribution in the mixing signal and the sounds emitted by the speaker or speakers doubling the dialogue contribution of the guidance signal. This height is called "pitch" in English. The lack of robustness also comes from a high sensitivity to any temporal misalignment, even slight, between the guide signal and the contribution of dialogue in the mixing signal.
[0004] Finally, the lack of robustness also comes from a sensitivity to differences in equalization between the guide signal and the mixing signal. The document of L. Le Magoarou et al. "Text-informed audio source separation using nonnegative matrix partial co-factorization," in IEEE International Workshop on Machine Learning for Signal Processing, Southampton, UK, September 2013, discloses an estimation algorithm based on a source-filter voice model for the dialogue contribution of the mixing signal and for the guidance signal. This algorithm makes it possible to take into account the differences in synchronization and overall equalization between the dialogue contribution of the mixing signal and the guidance signal. However, this algorithm, although robust to a fundamental frequency change of the guidance signal, has no parameter related to the fundamental frequency, the fundamental frequency is not a variable of the source-filter speech model used. This algorithm does not exploit the height information contained in the guidance signal. The object of the invention is therefore to overcome these problems by proposing an improved method of informed separation, which exploits, automatically, the differences in height between the guide signal and the dialogue contribution in the mixing signal.
[0005] The subject of the invention is therefore a method of separating, in an acoustic mix signal, a specific contribution and a background noise contribution, characterized in that it comprises the steps of: - acquiring an acoustic signal guidance corresponding to a reproduction of the only specific contribution; determining a parametric modeling of a specific signal corresponding to the specific contribution only, taking into account a fundamental frequency correction of each time frame; determining a parametric modeling of a background sound signal corresponding to the single background sound component; estimating a specific intermediate signal and an intermediate background sound signal, by adjusting the parameters of the modelizations and by using the acoustic mixing signal; filtering the acoustic mixing signal by using the specific intermediate signal and the intermediate background sound signal, to obtain a specific acoustic signal and an acoustic background sound signal.
[0006] According to other embodiments, the separation method comprises one or more of the following characteristics, taken in isolation or in any technically possible combination: the method comprises an initial step of transforming a temporal acoustic signal into a time-frequency representation, and a final step of transforming a time-frequency representation into a temporal acoustic signal, inverse to that of the initial transformation step, the determination, modeling, estimation and filtering steps being implementation in the frequency domain on time-frequency representations. the transformation of a temporal acoustic signal into a time-frequency representation is a logarithmic frequency scale transformation, in particular a constant Q transformation. the step of determining a parametric modeling of a background sound signal is based on a non-negative matrix decomposition. the step of determining a parametric modeling of a specific signal also makes it possible to take into account a correction of a time shift between the guide signal and the specific contribution in the mixing spectrogram. the step of determining a parametric modeling also makes it possible to take account of an equalization correction between the guidance signal and the mixing signal. the estimation step is based on the minimization of a cost function. the cost function uses a divergence, including the divergence of ITAKURA-SAITO. the step of determining a height correction parametric modeling of a spectrogram of the guidance signal 17g results in a Vslifted parametric guidance signal spectrogram of the form: Vsghifted (E04V9diag (Pcp, :)) where 4179 corresponds to an offset of the spectrogram matrix of the guidance signal V9 of q5 points time / frequency down, diag (Pcp, :) is the diagonal matrix which has as diagonal the components of the pthth line of the matrix P representative of a offset in height, and Ecp the summation operation on all the values of p. the step of estimating a parametric modeling of height correction and correction in synchronization of a spectrogram of the specific signal V9 leads to a parametric spectrogram of a guide signal Vs, of the form: Vsync 9 - VsghiftedS Where the matrix S is representative of a synchronization and is such that there exists an integer w such that for any pair of frames (t1, t2) such that if 1t1 - t2I> w, then st1t2 = 0. the step of estimating a parametric modeling of height correction, correction in synchronization and equalization correction of a spectrogram of the guidance signal Vg leads to a parametric spectrogram of guide signal Ia of the form: Vpg = diag (E) (E04Vgdiag (1314)) .S "where diag (E) is a diagonal matrix representative of an equalization whose diagonal consists of the components of the vector E. - the estimation step is iterative and sets implement the following update rules: PP ETCVg O (170-1-ST)) for the pitch correction (Ediag (E) 4V9 diag (130, :)) OVO f70-2 ss 0 for the synchronization correction , when taken into account, and, ((E4V9diag (P1,, DS) OVO -1 / 0-2) 1T EEC) for the equalization correction, when taken into account, where O is a operator corresponding to the term-to-term commodity between matrices (or vector); .00 is an operator corresponding to the exponentiation term-to-term of a mat rice by a scalar; (.) T is the transpose of a matrix; and 1T is a vector T x 1 whose all elements are equal to 1. the method comprises a first estimation step and a monitoring step, the tracking step consisting in optimizing a value of each parameter of the parametric modelings, obtained at the output of the first estimation step. the method comprises a second estimation step, the optimized value obtained from the monitoring step being taken as the initial value of the corresponding parameter in the second estimation step. the filtering step implements Wiener filtering. ## EQU1 ## A system for the implementation of a separation method according to any one of the preceding claims, characterized in that it comprises: a means for acquiring a guidance signal; determining a parametric modeling of a dialogue signal; a module for determining a parametric modeling of a background sound signal; a module for estimating a dialogue signal and a background signal; sound intermediate from a mixing signal; and, - a filtering module adapted to generate a dialogue signal and a background sound signal from the mixing signal and the intermediate dialogue and background signals. invention will be better understood on reading the following description of a particular embodiment given only as an example. lustrative and non-limiting, and with reference to the accompanying drawings in which: Figure 1 is a block representation of the various steps of the separation process according to the invention; Figure 2 is a schematic view of a system for carrying out the method of Figure 1; Figures 3 and 4 correspond to graphs that result from tests to compare, according to known normative criteria, the results of the implementation of the method of Figure 2, compared to methods of the prior art.
[0007] Referring to FIG. 1, the separation method 100 uses a mixing temporal acoustic signal x (t) and a g (0) temporal guide acoustic signal to deliver a dialogue acoustic signal y (t) and an acoustic signal. The signals are all acoustic signals, so that the acoustical qualifier will be omitted in what follows: these signals are time signals and depend on the time t The acoustic mixing signal is a source soundtrack, or at least one excerpt from a soundtrack The mixing sound signal x (t) comprises a first input and a second input.
[0008] The first contribution is a dialogue consisting of words uttered by one or more original speakers. The second contribution corresponds to what is called here background sound and includes sound special effects, music, etc.
[0009] The acoustic guide signal g (t) corresponds to the first contribution, but pronounced by a user who doubles the dialogue contribution of the mixing signal x (t). The acoustic dialogue signal y (t) corresponds to the only dialogue contribution, isolated from the rest of the mixing signal x (t), and the acoustic background signal z (t) corresponds to the single contribution of the background sound, isolated the remainder of the mixing signal x (t) The first step 110 of the method 100 thus consists in acquiring the guide signal g (t) by recording a speaker doubling the dialogue contribution of the mixing signal x (t).
[0010] In step 115, a log-frequency spectrogram of the acquired guidance signal g (t) is calculated. This spectrogram, V9, is defined as the square of the module of the constant Q transform or CQT (for "Constant-Q Transform") of the signal g (t). In order to avoid any confusion, it is preferable to use a different term to characterize the non-negative matrices (squared modulus of CQT) of the complex matrices (obtained by CQT, or supposed to model such a CQT). Thus, in what follows, the term "spectrogram" is used to denote non-negative matrices and the term "constant Q transformed" for complex matrices. For step 115, an algorithm, known to those skilled in the art, is used which makes it possible to pass from the time domain to the frequency domain, such that the central frequencies fc of each frequency sampling step ("bin") in English) are spaced from one another in a geometric progression and that quality factors Q of each frequency sampling step are constant between them. The quality factor Q of a frequency sampling step is given by: Q = of, where fc is the center frequency of the frequency sampling step At and 3, f is its width. This representation has the property that a modification of the frequency pitch of a sound, characterized by a fundamental frequency and a plurality of harmonics of this fundamental frequency, results in a simple translation shift along the frequency axis. spectrogram, or at least one frame of this spectrogram. This property will appear as fundamental in the step of correcting the height of the guidance signal which will be presented below. It should be noted that a frame corresponds to a temporal sampling step of a signal and therefore of the corresponding spectrogram.
[0011] In step 116, a spectrogram, Vx, of the mixing signal x (t) is calculated in the same way. The spectrogram of the guidance signal Vg is a matrix F x T. The spectrogram of the mixing signal Vx is a matrix F x T. T represents the total number of frames which subdivide the duration of the signal of the mixture x (t) and the signal guide g (t). The guide signal g (t) and the mixing signal x (t) have the same duration. If this is not the case, it is easy to perform a temporal modification directly on the matrix V9 by means of a synchronization matrix S (which will be presented in detail hereinafter) chosen to present a size T ' xT, where T 'is the temporal length of Vg is that of the matrix Vx. The spectrogram of the mixing signal Vx is modeled as the sum of the spectrogram of the dialogue signal, called f7Y, and the spectrogram of the background sound signal, called gz. This modeling is usual in the context of factorization decomposition methods in non-negative matrices.
[0012] It should be noted that â refers to a quantity which is an estimate of the quantity a. Thus, in the following steps of the method 100, it is sought to estimate the two output spectrograms whose sum is equal to the spectrogram of the mixture: Vx 1731 + 17z (1) The guide signal g (t) is not equal to the dialogue signal y (t). Indeed, between the guide signal g (t) and the dialogue contribution in the mixing signal x (t), there are differences which it is necessary to model to take them into account in the separation. A parametric spectrogram -VpY makes it possible to model the differences between the spectrogram of the guidance signal Vg and the dialogue contribution in the spectrogram of the mixing signal Vx. The determination of the parameters of the parametric spectrogram -VpY leads to the spectrogram of the estimated dialogue signal VY of equation (1). The parametric spectrogram VpY is determined from the guidance spectrogram Vg so as to allow three different types of adaptation: height shift operator is first applied in order to take into account, in a frame, the difference in pitch of the sound between the guide signal and the dialogue contribution in the mixing signal; a synchronization operator is then applied to take into account a slight time shift between the frames of the guidance signal and those of the dialogue contribution in the mixing signal, which correspond to the same speech or phoneme of the dialogue; an equalization operator is finally applied to allow global adaptation to take account of global spectral differences, or equalization, between the guidance signal and the mixing signal. When applying these three corrections, the corresponding parameters are forced to be non-negative. More precisely, in step 120, the height shift operator is applied to the Vg spectrogram. It is a matrix (1) x T called P applying a vertical offset to each time frame of the spectrogram of the guidance signal Vg. The spectrograms being calculated with a CQT transformation, a vertical offset of a frame corresponds to a modification of the height as specified above. The operation can be written: Vsghifted = 1 (1) 17g diag (130, :)) (2) where 1) Vg corresponds to an offset of the spectrogram matrix Vg of .1) time / frequency points ("bin") in English) down (ie [11) Vg] f, t = [Vg] f_o, t), and diag (Pcp ') is the diagonal matrix which has as main diagonal, the components of the lth line of the matrix P. The height shift operator P models a possible difference between the instantaneous height of the guide signal and that of the dialogue component of the mixing signal. In practice, only one change of height .1) per frame t must be retained. For this, a selection procedure will be applied as described below. In step 130, a synchronization operator, called matrix S, is applied. It is a matrix T x T allowing a temporal alignment between the spectrogram of the guidance signal and the dialogue component of the mixing signal: a time frame of the spectrogram of the mixing signal is modeled by a linear combination of neighboring frames the spectrogram (offset in height) of the guidance signal. This operation is expressed by the relation: Vsgync = VsghiftedS (3) where S is a band matrix, that is to say that there exists an integer w such that for any pair of frames (t1, t2) such that if It1 - t2 I> w, st1t2 = 0. The width w of the band of the matrix S corresponds to a misalignment tolerance between the frames. A large width w allows a great tolerance, but at the price of a less good estimate of the parameters of the model. The width w is thus advantageously limited to a small number of time frames of the guide signal. The correct synchronization is advantageously optimized during a selection procedure which will be presented below.
[0013] In step 140, the equalization operator is applied. It is a vector F X 1 called E, which acts as a global filter on the spectrogram (offset and synchronized) of the guidance signal. Thus, the parametric spectrogram of the dialogue signal, -VpY, is given by: V7) 31 = diag (E) (E04Vgdiag (1314)) .S "(4) where diag (E) is a diagonal matrix whose diagonal is consisting of the components of the vector E. In step 150, since no information is available on the content of the background sound signal z (t), a parametric spectrogram of the background signal Vpx is derived from a standard NMF model .
[0014] Thus, the spectrogram of the background sound signal 17z is modeled parametrically by: 973Z wH (5) where W is a non-negative matrix F x R and H a non-negative matrix R x T with the constraint: R well below F and to T. The choice of R is important and depends on the application. Columns of W can be seen as elemental spectral models and H as an activation matrix of these elementary models over time. In step 160, the method makes a first estimate of the parameters of the 17pY and gpz models. In order to estimate the parameters of these spectrograms, a cost function, C, based on a divergence per element, d, is used: C = D (7197.31 + fipz) = Ef, td (vfte + q) (6 In the presently contemplated embodiment, the Itakura-Saito divergence, well known to those skilled in the art, is used. It is written: d (alb) = b - log b - 1 (7) The cost function C is minimized so as to determine the optimal value of each parameter. This minimization is carried out by iterations, with multiplicative updating rules which are successively applied to each of the parameters of the spectrogram models: W, H, F, S and P. These updating rules are for example elaborated by considering the gradient (i.e. the partial derivative) of the cost function C with respect to each parameter.
[0015] More precisely, the gradient of the cost function with respect to the parameter considered is written in the form of a difference between two positive terms, and the corresponding updating rule is a multiplication of the parameter considered by the ratio of these two terms. . This allows in particular that the parameters remain non-negative at each update and become constant if the gradient of the cost function with respect to the parameter considered tends to zero. In this way, the parameters evolve towards a local minimum. The rules for updating the parameters of the spectrogram model of the dialogue signal g / are as follows: ((r1- (1) V9 diag (PejS) C) VC) CC) -2) 1T EEC) ((r1 - (1) Vg diag (Pcp, DS) C) 170-1) 1T (Ediag (E) 4V9 diag (Pcp, :) C) VC) 90-2 S 50 ET (4179 O ((y O 90- 2) sT)) AND (4Vg o (go-1ST)) where C) is an operator corresponding to the term term product between matrices (or vector); .00 is an operator corresponding to the exponentiation term of a matrix by a scalar; (.) T is the transpose of a matrix; and 1T is a vector T x 1 whose all elements are equal to 1. The updating rules of W and H are the standard update multiplication rules for an NMF method with a cost function based on the divergence diag (E) 4V9 diag (Pcp, D) C) 90-1 (8) (9) (10) from Itakura-Saito. For example, the paper by C. Févotte et al., "Nonnegative matrix factorization with the Itakura-Saito divergence, with application to music analysis," Neural Computation, vol. 11, no. 3, pp. 793-830, March 2009, describes such an update. For this first estimation step, all the parameters are initialized with non-negative values chosen randomly. In step 170, the method enters a step of optimizing the parameters and, in particular, the parameters of the operator P of height shift. A VY spectrogram frame is modeled (to a ready equalizer operator and synchronization operator) as a linear combination of frames resulting from the height shift of the corresponding frame of the V9 spectrogram. To describe only small differences in height, a single height shift is advantageously retained per frame. The optimization step is therefore used to determine this unique value of the frame offset parameter.
[0016] To do this, a method of tracking the height shift across the matrix P is used. More precisely, in the embodiment currently envisaged, a so-called Viterbi tracking algorithm, known to those skilled in the art, is applied to the matrix P resulting from the first step 160 of estimating the parameters. For example, the paper J.-L. Durrieu et al, "An iterative approach to monaural musical mixture of-soloing," in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp . 105 - 108, describes such a tracking algorithm. Then, once an optimal height offset selected per frame, the coefficients of the matrix P which do not correspond to this optimal height offset are set to 0. An optimized offset matrix Poo is obtained. In practice, advantageously, a small margin around the optimal height offset is allowed. On the one hand, height offsets, if quantized in the present process, are actually continuous. On the other hand, the tracking algorithm can produce small errors. Thus, the coefficients of the matrix P are smoothed according to a predetermined law around the optimum value of the parameter. As a variant, it is also possible to optimize the synchronization matrix S by the implementation of another monitoring process adapted to the optimization of the parameters of this operator. Then, in step 180, the method 100 includes a second estimate of the parameters of the model f7p Y and the model gpz.
[0017] This second estimate is identical to the first estimate (step 160) except that the operators are initialized with the optimized operator (s) of step 170 (instead of a random initialization). It should be noted that, since the update rules are multiplicative, the coefficients of the matrix P (and possibly those of S) set to 0 will remain at 0 during the second estimation step. At the end of step 180, the final values of the various parameters determined make it possible to calculate intermediate giYet giz spectrograms (step 190). Finally, in step 200, the actual separation is carried out by means of Wiener filtering using the constant Q transform of the mixing signal and the intermediate spectrograms giYet. Thus the constant Q transforms of the dialogue signal are obtained. VY and the background sound signal V. By an inverse transformation of that of steps 115 and 116, the dialogue output signals y (t) and the background sound signal z (t) are obtained (steps 205, 206). In the embodiments described here in detail, these acoustic signals are monophonic signals. In a variant, these signals are stereophonic. More generally, they are multichannel. The person skilled in the art knows how to adapt to stereophonic or multichannel signals the treatments presented for the case of monophonic signals. In particular, an additional panoramic parameter can be used in the modeling of the dialogue signal from the guidance signal. Is shown in Figure 2 a system 10 for carrying out the method presented above. It comprises a central server 12 connected via a communication network 14, for example the Internet, to a client computer 16. The client computer 16 executes an application allowing a user to select a mix soundtrack, to listen the selected soundtrack and record a speaker doubling the dialogs of the selected soundtrack. The mixing soundtrack can be selected through the internet, for example on a database accessible online. The mixing soundtrack may also be selected from a recording medium belonging to the user and read by the client computer 16. The mixing signal x (t), corresponding to the selected soundtrack, and the guiding signal g (t), corresponding to the recording made, are transmitted, via the Internet, to the central server 12.
[0018] The central server 12 comprises calculation means and storage means. The calculation means are suitable for executing a program whose instructions are stored in the storage means for the implementation of the method 100 from the mixing signals x (t) and guidance g (t) received from the computer. Client 16. The server 12 thus comprises: a module 20 for calculating a log-frequency spectrogram from the dialogue signal or the mixing signal; a first modeling module 30 for obtaining, from the spectrogram of the dialogue signal, a parametric spectrogram of the dialogue signal comprising a submodule 32 for applying a height shift operator, a submodule 34, application of a time synchronization operator, and an application submodule 36 of an equalization operator; a second modeling module 40 for obtaining, from the spectrogram of the mixing signal, a parametric spectrogram of the background sound signal; a module 50 for estimating parameters of the parametric spectrograms, taking into account the spectrogram of the mixing signal; an optimization module 60 comprising a module 62 for optimizing the parameters of the height shift operator and a module 64 for optimizing the parameters of the synchronization operator; a module 70 for determining the spectrograms of the dialogue signal and the background sound signal from the optimized parameters, the module 70 implementing a Wiener filtering; and a module 80 for calculating a signal from a spectrogram.
[0019] Finally, the central server 12, after processing the signals transmitted to it, and obtaining, from them, a dialogue signal y (t) and a background sound signal z (t), is able to communicate these two output signals to the client computer 16. Comparative tests have been conducted to compare the results of the implementation of the present method with those of known methods: the first known method is a separation based on an NMF type method including a source-filter voice model without guidance information; the second known method is a separation informed by the data of a guide signal corresponding to the dialogue contribution and using a PLCA analysis; - The third known method is similar to the first, but uses as guidance information a frame-by-frame annotation of the fundamental frequency (this annotation is done manually and is therefore tedious and expensive). A database of soundtracks has been created. A soundtrack of this database results from a superimposition operation of a band comprising only dialogs (in English), with a soundtrack with only music and special effects. In this way the contributions of each source in the corresponding mixing signal are precisely known. The database consists of ten soundtracks.
[0020] To obtain a guidance signal, each soundtrack was dubbed using the corresponding mixing signal as a time reference. All dubbing was done by the same male speaker of English mother tongue. The guidance signal obtained is used for the method according to the invention and the second known method.
[0021] The spectrograms were calculated using a CQT transformation with the following values of the minimum frequency, fmin = 40Hz, of the maximum frequency, fmax = 16000 Hz, and with 48 frequency sampling steps per octave. In order to quantify the results obtained for the various known processes and the method according to the invention, standard indicators of the field of source separation have been calculated. These indicators are the signal-to-distortion ratio (SDR), the signal-to-artifact ratio (SAR) and the SIR signal-to-interference ratio (according to the "signal to distortion ratio"). the acronym "Signal to Interference Ratio"). The results are shown in Figures 3 for the dialogue signal and Figure 4 for the background sound signal. In these figures, the first three columns represent the three known methods, the fourth the method according to the invention, and the fifth an ideal case where the dialogue and background strips having been used to construct the mixing soundtrack are used. directly at the input of the Wiener filtering step.
[0022] The results of the first process are significantly worse than any of the informed processes, which confirms the advantages of these processes. The second known method is less good than the third known method and the method according to the invention. On the other hand, these last two processes are not clearly distinguished using the standard indicators. The differences according to the SDR indicator are not significant. The results according to the SAR and SIR indicators give the advantage to the third known method for extracting the dialogue contribution, but give the advantage to the method according to the invention for the inverse task of suppressing the dialogue contribution ( that is to say extraction of the background contribution). However, additional qualitative indicators lead to give the advantage to the process according to the invention. In fact, blind listening tests based on the MUSHRA protocol have been performed by inviting listeners to evaluate the dialogue signals obtained by means of the third known method and the method according to the invention. On the criterion of "fitness for use", these auditors preferred the results obtained with the process according to the invention.
[0023] It should be emphasized that the method according to the invention performs height correction automatically, without the need to inform a baseline, unlike the third known method. Alternatively, other embodiments of a system for implementing the method according to the invention are conceivable.
[0024] The present embodiment illustrates the particular case of the separation of a dialogue from a mixing signal, by adapting the spectrogram of the voice guidance signal provided in height, and possibly in time and / or in equalization, upstream of the signal. use of a method of factorization by non-negative matrices. However, the present method does not use a specific voice model for the guidance signal. Consequently, the present method is in fact adapted to the separation in a mixing acoustic signal of any type of specific acoustic contribution for which the operator has an acoustic guide signal. Such a guidance signal is another recording in terms of pitch, time synchronization and overall equalization as the recording of the specific acoustic contribution in the mixing signal. The present invention makes it possible to model these differences in height, time synchronization and global equalization and to compensate for them during the separation. Thus, instead of a voice, the specific acoustic contribution can be the sound of a particular instrument in a musical signal mixing several instruments. The part of this particular instrument is replayed and recorded to serve as a guide signal. Or, the specific acoustic contribution is a recording of the music alone, which was used for the creation of the soundtrack of an old movie. This recording generally has small differences in pitch, speed, and overall EQ with the contribution of music in the soundtrack of the movie because all these signals were originally stored on analog media. This recording can be used as a guide signal in the present method, in order to recover dialogues and effects. Those skilled in the art will understand that the process of the document by L. Le Magoarou et al. is not suitable for these last two applications.
权利要求:
Claims (16)
[0001]
CLAIMS1.- A separation method (100), in a mixing acoustic signal (x (t)), of a specific contribution and of a background noise contribution, characterized in that it comprises the steps of: acquiring (110) an acoustic guide signal (g (t)) corresponding to a reproduction of the only specific contribution; - determining (120, 130, 140) a parametric modeling of a specific signal (97, Y) corresponding to the only specific contribution, taking into account a fundamental frequency correction of each time frame; determining (150) a parametric modeling of a background sound signal (gpz) corresponding to the single background sound component; - estimating (170, 190) a specific intermediate signal (V19) and an intermediate background sound signal (V1z), adjusting the parameters of the modelizations and using the mixing acoustic signal (x (t)); and, filtering (200) the mixing acoustic signal using the specific intermediate signal and the intermediate background sound signal, to obtain a specific acoustic signal (y (t)) and an acoustic background signal (z (t)). ).
[0002]
2. A method according to claim 1, characterized in that it comprises an initial step (115, 116) of transformation of a temporal acoustic signal into a time-frequency representation, and a final transformation step (205, 206). a time-frequency representation in a temporal acoustic signal, inverse to that of the initial transformation step, the determination, modeling, estimation and filtering steps being implemented in the frequency domain on time representations -frequency.
[0003]
3. A method according to claim 2, characterized in that the transformation of a temporal acoustic signal into a time-frequency representation is a logarithmic frequency scale transformation, in particular a constant Q transformation.
[0004]
4. A method according to any one of the preceding claims, wherein the step of determining a parametric modeling of a background sound signal (V) is based on a non-negative matrix decomposition.
[0005]
5. Method according to any one of the preceding claims, characterized in that the step of determining a parametric modeling of a specific signal also makes it possible to take into account a correction of a time shift between the guidance signal. and the specific contribution in the mixing spectrogram.
[0006]
6. A method according to any one of the preceding claims, characterized in that the step of determining a parametric modeling also allows to take into account an equalization correction between the guide signal and the mixing signal.
[0007]
7. A method according to any one of the preceding claims, characterized in that the estimation step (170, 190) is based on the minimization of a cost function (C).
[0008]
8.- Method according to claim 7, characterized in that the cost function (C) uses a divergence (d), in particular the divergence of ITAKURA-SAITO.
[0009]
9. A method according to any one of claims 3 to 6, characterized in that the step of determining a parametric correction model height of a spectrogram of the guidance signal 17g leads to a parametric signal spectrogram of Vslifted guidance of the form: Vsghifted = (E04179 diag (Pcp, D) where 4171 corresponds to a shift of the spectrogram matrix of the guidance signal 17g of q5 points time / frequency down, diag (Pcp, :) is the matrix diagonal, which possesses as diagonal the components of the pthem line of the matrix P representative of a shift in height, and Ecp the summation operation on the set of values of p.
[0010]
10. A method according to claim 9 and claim 5 in combination, characterized in that the step of estimating a parametric model of correction in height and correction in synchronization of a spectrogram of the specific signal Vg leads to a spectrogram parametric guide signal Vs, 'of the form: 9 17.sinc = VsghiftedS Where the matrix S is representative of a synchronization and is such that there exists an integer w such that for any pair of frames (t1, t2) as if 1t1 - t2I> w, then st1t2 = 0.
[0011]
11. A method according to claim 10 and claim 6 in combination, characterized in that the step of estimating a parametric modeling of height correction, correction in synchronization and equalization correction of a spectrogram of the signal guidance Vg leads to a parametric guide signal spectrogram vpg of the form: Vpg = diag (E) (E04Vgdiag (130, :)) .S "where diag (E) is a diagonal matrix representative of an equalization whose diagonal is composed of the components of the vector E.
[0012]
12. A method according to any one of claims 9 to 11, characterized in that the estimation step is iterative and implements the following update rules: 130: 130 ,: O ETCVg (i) ( 90-1-ST)) ETCvg O ((VO -9 ° -2) sT)) for height correction; (Ediag (E) 4V9 diag (130, :)) O V O 90-2 s s 0 for synchronization correction, when taken into account; and, ((E4Vgdiag (PIJS) V O V ° -2) 1T0E E C) for the equalization correction, when taken into account, where O is an operator corresponding to the term-term product between matrices (or vector); .00 is an operator corresponding to the exponentiation term of a matrix by a scalar; (.) T is the transpose of a matrix; and 1T is a vector T x 1 whose all elements are equal to 1. (Ediag (E) 4V9diag (130, :)) O ((E V9dtag (130, DS) O 90-1) 17,
[0013]
13.- Method according to any one of the preceding claims, characterized in that it comprises a first step (170) estimation and a step (180) monitoring, the monitoring step of optimizing a value of each parametric model parameter, obtained at the output of the first estimation step.
[0014]
14.- Method according to claim 13, characterized in that it comprises a second estimation step (190), the optimized value obtained from the tracking step (180) being taken as initial value of the corresponding parameter in the second estimation step.
[0015]
15.- Method according to any one of the preceding claims, characterized in that the filtering step implements a Wiener filtering.
[0016]
16. A system for implementing a separation method according to any one of the preceding claims, characterized in that it comprises: - means (16) for acquiring a guide signal; a module (30) for determining a parametric modeling of a dialogue signal; a module (40) for determining a parametric modeling of a background sound signal; a module (60) for estimating a dialogue signal and an intermediate sound signal from a mixing signal (x (t)); and a filtering module (70) for generating a dialogue signal (y (t)) and a background sound signal from the mixing signal and the intermediate speech and sound dialogue signals Z (177, gi ).
类似技术:
公开号 | 公开日 | 专利标题
EP2374124B1|2013-05-29|Advanced encoding of multi-channel digital audio signals
US8103511B2|2012-01-24|Multiple audio file processing method and system
FR3013885A1|2015-05-29|METHOD AND SYSTEM FOR SEPARATING SPECIFIC CONTRIBUTIONS AND SOUND BACKGROUND IN ACOUSTIC MIXING SIGNAL
US9767846B2|2017-09-19|Systems and methods for analyzing audio characteristics and generating a uniform soundtrack from multiple sources
CN110709924A|2020-01-17|Audio-visual speech separation
EP3040989B1|2018-10-17|Improved method of separation and computer program product
EP2987165A1|2016-02-24|Frame loss correction by weighted noise injection
Canazza et al.2009|Restoration of audio documents by means of extended Kalman filter
Kilgour et al.2018|Fr'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
Canazza2012|The digital curation of ethnic music audio archives: from preservation to restoration
Liutkus et al.2010|Informed source separation using latent components
US9058384B2|2015-06-16|System and method for identification of highly-variable vocalizations
KR102018286B1|2019-10-21|Method and Apparatus for Removing Speech Components in Sound Source
Liutkus et al.2010|Separation of music+ effects sound track from several international versions of the same movie
US10614827B1|2020-04-07|System and method for speech enhancement using dynamic noise profile estimation
Cabras et al.2010|Restoration of audio documents with low SNR: a NMF parameter estimation and perceptually motivated Bayesian suppression rule
Gaultier2019|Design and evaluation of sparse models and algorithms for audio inverse problems
Manilow et al.2017|Leveraging repetition to do audio imputation
Lopatka et al.2016|Improving listeners' experience for movie playback through enhancing dialogue clarity in soundtracks
Cabras et al.2010|The restoration of single channel audio recordings based on non-negative matrix factorization and perceptual suppression rule
FR2627887A1|1989-09-01|SPEECH RECOGNITION SYSTEM AND METHOD OF FORMING MODELS THAT CAN BE USED IN THIS SYSTEM
Czyżewski et al.2012|Online sound restoration for digital library applications
Braun et al.2021|Effect of noise suppression losses on speech distortion and ASR performance
Pretto et al.2021|A workflow and novel digital filters for compensating speed and equalization errors on digitized audio open-reel tapes
Liu et al.2021|Identification of fake stereo audio
同族专利:
公开号 | 公开日
US20150149183A1|2015-05-28|
US9633665B2|2017-04-25|
FR3013885B1|2017-03-24|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US6691082B1|1999-08-03|2004-02-10|Lucent Technologies Inc|Method and system for sub-band hybrid coding|
KR100716984B1|2004-10-26|2007-05-14|삼성전자주식회사|Apparatus and method for eliminating noise in a plurality of channel audio signal|
US8812322B2|2011-05-27|2014-08-19|Adobe Systems Incorporated|Semi-supervised source separation using non-negative techniques|
EP2898506B1|2012-09-21|2018-01-17|Dolby Laboratories Licensing Corporation|Layered approach to spatial audio coding|
FR3031225B1|2014-12-31|2018-02-02|Audionamix|IMPROVED SEPARATION METHOD AND COMPUTER PROGRAM PRODUCT|
TWI573133B|2015-04-15|2017-03-01|國立中央大學|Audio signal processing system and method|EP3324407A1|2016-11-17|2018-05-23|Fraunhofer Gesellschaft zur Förderung der Angewand|Apparatus and method for decomposing an audio signal using a ratio as a separation characteristic|
EP3324406A1|2016-11-17|2018-05-23|Fraunhofer Gesellschaft zur Förderung der Angewand|Apparatus and method for decomposing an audio signal using a variable threshold|
WO2020081872A1|2018-10-18|2020-04-23|Warner Bros. Entertainment Inc.|Characterizing content for audio-video dubbing and other transformations|
CN113573136B|2021-09-23|2021-12-07|腾讯科技(深圳)有限公司|Video processing method, video processing device, computer equipment and storage medium|
法律状态:
2015-11-12| PLFP| Fee payment|Year of fee payment: 3 |
2016-09-28| PLFP| Fee payment|Year of fee payment: 4 |
2017-09-18| PLFP| Fee payment|Year of fee payment: 5 |
2018-09-19| PLFP| Fee payment|Year of fee payment: 6 |
2019-08-30| PLFP| Fee payment|Year of fee payment: 7 |
2021-08-06| ST| Notification of lapse|Effective date: 20210705 |
优先权:
申请号 | 申请日 | 专利标题
FR1361792A|FR3013885B1|2013-11-28|2013-11-28|METHOD AND SYSTEM FOR SEPARATING SPECIFIC CONTRIBUTIONS AND SOUND BACKGROUND IN ACOUSTIC MIXING SIGNAL|FR1361792A| FR3013885B1|2013-11-28|2013-11-28|METHOD AND SYSTEM FOR SEPARATING SPECIFIC CONTRIBUTIONS AND SOUND BACKGROUND IN ACOUSTIC MIXING SIGNAL|
US14/555,230| US9633665B2|2013-11-28|2014-11-26|Process and associated system for separating a specified component and an audio background component from an audio mixture signal|
[返回顶部]